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Abstract 

A typical goal of supervised dimension reduction is to find a low-dimensional 
subspace of the input space such that the projected input variables preserve max¬ 
imal information about the output variables. The dependence maximization ap¬ 
proach solves the supervised dimension reduction problem through maximizing a 
statistical dependence between projected input variables and output variables. A 
well-known statistical dependence measure is mutual information (MI) which is 
based on the Kullback-Leibler (KL) divergence. However, it is known that the KL 
divergence is sensitive to outliers. On the other hand, quadratic MI (QMI) is a 
variant of MI based on the L2 distance which is more robust against outliers than 
the KL divergence, and a computationally efficient method to estimate QMI from 
data, called least-squares QMI (LSQMI), has been proposed recently. For these 
reasons, developing a supervised dimension reduction method based on LSQMI 
seems promising. However, not QMI itself, but the derivative of QMI is needed for 
subspace search in supervised dimension reduction, and the derivative of an accu¬ 
rate QMI estimator is not necessarily a good estimator of the derivative of QMI. In 
this paper, we propose to directly estimate the derivative of QMI without estimat¬ 
ing QMI itself. We show that the direct estimation of the derivative of QMI is more 
accurate than the derivative of the estimated QMI. Finally, we develop a supervised 
dimension reduction algorithm which efficiently uses the proposed derivative es¬ 
timator, and demonstrate through experiments that the proposed method is more 
robust against outliers than existing methods. 


1 Introduction 

Supervised learning is one of the eentral problems in maehine learning whieh aims 
at learning an input-output relationship from given input-output paired data samples. 
Although many methods were proposed to perform supervised learning, they often work 
poorly when the input variables have high dimensionality. Sueh a situation is eommonly 
referred to as the curse of dimensionality (Bishop, 2006), and a eommon approaeh to 
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mitigate the curse of dimensionality is to preprocess the input variables by dimension 
reduction (Burges, 2010). 

A typical goal of dimension reduction in supervised learning is to find a low¬ 
dimensional subspace of the input space such that the projected input variables preserve 
maximal information about the output variables. Thus, a successive supervised learn¬ 
ing method can use the low-dimensional projection of the input variables to learn the 
input-output relationship with a minimal loss of information. The purpose of this paper 
is to develop a novel supervised dimension reduction method. 

The dependence maximization approach solves the supervised dimension reduction 
problem through maximizing a statistical dependence measure between projected input 
variables and output variables. Mutual information (MI) is a well-known tool for mea¬ 
suring statistical dependency between random variables (Cover and Thomas, 1991). MI 
is well-studied and many methods were proposed to estimate MI from data. A notable 
method is the maximum likelihood Ml (MLMI) (Suzuki et ah, 2008), which does not 
require any assumption on the data distribution and can perform model selection via 
cross-validation. For these reasons, MLMI seems to be an appealing tool for supervised 
dimension reduction. However, MI is defined based on the Kullback-Leibler divergence 
(Kullback and Leibler, 1951), which is known to be sensitive to outliers (Basu et ah, 
1998). Hence, MI is not an appropriate tool when it is applied on a dataset containing 
outliers. 

Quadratic MI (QMI) is a variant of MI (Principe et ah, 2000). Unlike MI, QMI is 
defined based on the L 2 distance. A notable advantage of the L 2 distance over the KL 
divergence is that the L 2 distance is more robust against outliers (Basu et ah, 1998). 
Moreover, a computationally efficient method to estimate QMI from data, called least- 
squares QMI (LSQMI) (Sainui and Sugiyama, 2013), was proposed recently. LSQMI 
does not require any assumption on the data distribution and it can perform model se¬ 
lection via cross-validation. For these reasons, developing a supervised dimension re¬ 
duction method based on LSQMI is more promising. 

An approach to use LSQMI for supervised dimension reduction is to firstly esti¬ 
mate QMI between projected input variables and output variables by LSQMI, and then 
search for a subspace which maximizes the estimated QMI by a nonlinear optimization 
method such as gradient ascent. However, the essential quantity of the subspace search 
is the derivative of QMI w.r.t. the subspace, not QMI itself. Thus, LSQMI may not be 
an appropriate tool for developing supervised dimension reduction methods since the 
derivative of an accurate QMI estimator is not necessarily an accurate estimator of the 
derivative of QMI. 

To cope with the above problem, in this paper, we propose a novel method to directly 
estimate the derivative of QMI without estimating QMI itself. The proposed method 
has the following advantageous properties: it does not require any assumption on the 
data distribution, the estimator can be computed analytically, and the tuning parameters 
can be objectively chosen by cross-validation. We show through experiments that the 
proposed direct estimator of the derivative of QMI is more accurate than the derivative 
of the estimated QMI. Then we develop a fixed-point iteration which efficiently uses the 
proposed estimator of the derivative of QMI to perform supervised dimension reduction. 
Finally, we demonstrate the usefulness of the proposed supervised dimension reduction 
method through experiments and show that the proposed method is more robust against 
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outliers than existing methods. 

The organization of this paper is as follows. We firstly formulate the supervised 
dimension reduction problem and review some of existing methods in Section 2, Then 
we give an overview of QMI and review some of QMI estimators in Section 3. The 
details of the proposed derivative estimator are given in Section 4. Then in Section 5 we 
develop a supervised dimension reduction algorithm based on the proposed derivative 
estimator. The experiment results are given in Section 6. The conclusion of this paper 
is given in Section 7. 

2 Supervised Dimension Reduction 

In this section, we firstly formulate the supervised dimension reduction problem. Then 
we briefly review existing supervised dimension reduction methods and discuss their 
problems. 

2.1 Problem Formulation 

Let Dx C and Vy C be the input domain and output domain with dimension¬ 
ality dx and dy, respectively, and p{x, y) be a joint probability density on "Dx x T^y 
Firstly, assume that we are given an input-output paired data set V = {(a;*, 
where each data sample is drawn independently from the joint density: 


i.i.d. 




p{x,y). 


n 


Next, let W G {W G be an orthonormal matrix with a 

known constant < dx, where Id^ denotes the dz-by-d^ identity matrix and ^ denotes 
the matrix transpose. Then assume that there exists a dz-dimensional subspace in 
spanned by the rows of W such that the projection of x onto this subspace denoted 
by z = W X preserves the maximal information about y of x. That is, we can sub¬ 
stitute X hy z with a minimal loss of information about y. We refer to the problem 
of estimating W from the given data as supervised dimension reduction. Below, we 
review some of the existing supervised dimension reduction methods. 

2.2 Sliced Inverse Regression 

Sliced inverse regression (SIR) (Li, 1991) is a well known supervised dimension reduc¬ 
tion method. SIR formulates supervised dimension reduction as a problem of finding 
W which makes x and y conditionally independent given z: 


{x ±y)\ z. 

The key principal of SIR lies on the following equality 

d-2, 


( 1 ) 



( 2 ) 


^For simplicity, we assume that x is standardized so that E[a;] = 0 and E[a;a;^] = I 
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where E denotes the eonditional expeetation and Wi denotes the i-th row of W. The 
importanee of this equality is that if the equality holds for any c G and some eon- 
stants oo, ai,..., then the inverse regression eurve E[ic||/] lies on the space spanned 
by W which satisfies Eq.(l). Based on this fact, SIR estimates W as follows. First, 
the range of y is sliced into multiple slices. Then E[a^||/] is estimated as the mean of x 
for each slice of y. Finally, W is obtained as the largest principal components of the 
covariance matrix of the means. 

The significant advantages of SIR are its simplicity and scalability to large datasets. 
However, SIR relies on the equality in Fq.(2) which typically requires that p{x) is an 
elliptically symmetric distribution such as Gaussian. This is restrictive and thus the 
practical usefulness of SIR is limited. 

2.3 Minimum Average Variance Estimation based on the Condi¬ 
tional Density Functions 

The minimum average variance estimation based on the conditional density functions 
(dMAVF) (Xia, 2007) is a supervised dimension method which does not require any 
assumption on the data distribution and is more practical compared to SIR. Briefly 
speaking, dMAVF aims to find a matrix W which yields an accurate non-parametric 
estimation of the conditional density p{y\z). 

The essential part of dMAVE is the following model: 

Hb{y-y) = mb{z,y) +eb{y\z), 

where H}, denotes a symmetric kernel function with bandwidth b > 0, mb{z,y) de¬ 
notes a conditional expectation of Hh{y — y) given 2 , and eb{y\z) = Hh{y — y) — 
¥.[Hb{y — y)\z\ with ¥,[eb{y\z)] = 0. An important property of this model is that 
mb{z,y) —)■ p{y\z) when fe —)■ 0 as n —)■ cxo. Then, dMAVF estimates mb{z,y) by a 
local linear smoother (Fan et ah, 1996). More specifically, a local linear smoother of 
mb{zi,yk) is given by 


mb{Zi,yk) ^ mb{zj,yk) + _ zf 

= ttjk -f {Xi -Xj), (3) 

where Zj is an arbitrary point close to Zj, and ajk G M and bjk G are parame¬ 
ters. Based on this local linear smoother, dMAVF solves the following minimization 
problem: 


It/ IL 

min — p{xj,yk) Y [^biyi - yk) - ajk - bJ^W {xi - xf] Kh{xi, xf, 

W Ti . 

j,k=l 1=1 

(4) 

where Kh is a symmetric kernel function with bandwidth h > 0. The function p{x, y) 
is a trimming function which is evaluated as zero when the densities oi x ox y are 
lower than some threshold. A solution to this minimization problem is obtained by 
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alternatively solving quadratie programming problems for W, and (ajk, bjk) until eon- 
vergenee. 

The main advantage of dMAVE is that it does not require any assumption on the 
data distribution. However, a significant disadvantage of dMAVE is that there is no 
systematic method to choose the kernel bandwidths and the trimming threshold. In 
practice, dMAVE uses a bandwidth selection method based on the normal-reference 
rule of the non-parametric conditional density estimation (Silverman, 1986; Ean et ah, 
1996), and a fixed trimming threshold. Although this model selection strategy works 
reasonably well in general, it does not always guarantee good performance on all kind 
of datasets. 

Another disadvantage of dMAVE is that the optimization problem in Eq.(4) may 
have many local solutions. To cope with this problem, dMAVE proposed to use a su¬ 
pervised dimension reduction method called the outer product of gradient based on con¬ 
ditional density functions (dOPG) (Xia, 2007) to obtain a good initial solution. Thus, 
dMAVE may not perform well if dOPG fails to provide a good initial solution. 

2.4 Kernel Dimension Reduction 

Another supervised dimension reduction method which does not require any assump¬ 
tion on the data distribution is kernel dimension reduction (KDR) (Eukumizu et ah, 
2009). Unlike dMAVE which focuses on the conditional density, KDR aims to find a 
matrix W which satisfies the conditional independence in Eq.(l). The key idea of KDR 
is to evaluate the conditional independence through a conditional covariance operator 
over reproducing kernel Hilbert spaces (RKHSs) (Aronszajn, 1950). 

Throughout this subsection, we use ("Hz, kz) to denote an RKHS of functions on the 
domain Dz equipped with reproducing kernel kz'. 


{f,kz{-,z))^^ = f(z), 


for f E T-Lz and z E Vz- The RKHSs of functions on domains Dx and Vy are also 
defined similarly as ("Hx, kf) and {diy, /Cy), respectively. The cross-covariance operator 
T^yz '■ diz diy satisfies the following equality for all f E Hz and g E Hy: 


{g, ^Yzf)ny = [f{z)g{y)] - [f{z)] E^ [g{y)], 


where E^^, E^, and E^ denotes expectations over densities p{z,y), p{z), and p{y), 
respectively. Then, the conditional covariance operator can be defined using cross¬ 
covariance operators as 



(5) 


where it is assumed that always exists. The importance of the conditional covari¬ 
ance operator in supervised dimension reduction lies in the following relations: 


'^YY\Z > 

where the inequality refers to the partial order of self-adjoint operators, and 


( 6 ) 



( 7 ) 
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These relations mean that the eonditional independence can be achieved by finding a 
matrix W which minimizes I^yviz in the partial order of self-adjoint operators. Based 
on this fact, KDR solves the following minimization problem: 


Tr[GY(Gz + AnInr'], 


( 8 ) 


mm 

W&{W\WW^=IaJ 


where A„ denotes a regularization parameter, Gz and Gy denotes centered Gram ma¬ 
trices with the kernels and ky, respectively, and Tr [•] denotes the trace of an operator. 
A solution to this minimization problem is obtained by a gradient descent method. 

KDR does not require any assumption on the data distribution and was shown to 
work well on various regression and classification tasks (Fukumizu et ah, 2009). How¬ 
ever, KDR has two disadvantages in practice. The first disadvantage of KDR is that even 
though the kernel parameters and the regularization parameter can heavily affect the per¬ 
formance, there seems to be no justifiable model selection method to choose these pa¬ 
rameters so far. Although it is always possible to choose these tuning parameters based 
on a criterion of a successive supervised learning method with cross-validation, this 
approach results in a nested loop of model selection for both KDR itself and the succes¬ 
sive supervised learning method. Moreover, this approach makes supervised dimension 
reduction depends on the successive supervised learning method which is unfavorable 
in practice. 

The second disadvantage is that the optimization problem in Eq.(8) is non-convex 
and may have many local solutions. Thus, if the initial solution is not properly cho¬ 
sen, the performance of KDR may be unreliable. A simple approach to cope with this 
problem is to choose the best solution with cross-validation based on the successive 
supervised learning method, but this approach makes supervised dimension reduction 
depends on the successive supervised learning method and is unfavorable. A more so¬ 
phisticated approach was considered in Fukumizu and Leng (2014) which proposed 
to use a solution of a supervised dimension reduction method called gradient-based 
kernel dimension reduction (gKDR) as an initial solution for KDR. However, it is not 
guarantee that gKDR always provide a good initial solution for KDR. 

2.5 Least-Squares Dimension Reduction 

The least-squares dimension reduction (LSDR) (Suzuki and Sugiyama, 2013) is another 
supervised dimension reduction method which does not require any assumption on the 
data distribution. Similarly to KDR, LSDR aims to find a matrix W which satisfies the 
conditional independence in Eq.(l). However, instead of the conditional covariance op¬ 
erators, LSDR evaluates the conditional independence through a statistical dependence 
measure. 

LSDR utilizes a statistical dependence measure called squared-loss mutual infor¬ 
mation (SMI). SMI between random variables z and y is defined as 



(9) 


SMI(Z, Y) is always non-negative and equals to zero if and only if z and y are sta¬ 
tistically independent, i.e., p{z,y) = p{z)p{y). The important properties of SMI in 
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supervised dimension reduetion are the following relations: 


SMI(Z,F) < SMI(X,F), 

and 


SMI(Z, Y) = SMI(X, Y) (a? X 2/) I 2. 


Thus, the eonditional independence can be achieved by finding a matrix W which 
maximizes SMI(Z, Y). Since SMI(Z, Y) is typically unknown, it is estimated by the 


least-squares mutual information (Suzuki et ah, 2009) method which directly estimates 


the density ratio 


p{z,y) 


p{z)p[y) 

the following maximization problem: 


without performing any density estimation. Then, LSDR solves 


max SMI(Z,F), 

W&{W\WW^=Ia^} 


( 10 ) 


where SMI(Z, Y) denotes the estimated SMI. The solution to this maximization prob¬ 
lem is obtained by a gradient ascent method. Note that this maximization problem is 
non-convex and may have many local solutions. 

LSDR does not require any assumption on the data distribution, similarly to dMAVE 
and KDR. However, the significant advantage of LSDR over dMAVE and KDR is that 
ESDR can perform model selection via cross-validation and avoid a poor local solu¬ 
tion without requiring any successive supervised learning method. This is a favorable 
property as a supervised dimension reduction method. 

However, a disadvantage of ESDR is that the density ratio function p^z)p{y) 
be highly fluctuated, especially when the data contains outliers. Since it is typically 
difficult to accurately estimate a highly fluctuated function, ESDR could be unreliable 
in the presence of outliers. 

Next, we consider a supervised dimension reduction approach based on quadratic 
mutual information which can overcome the disadvantages of the existing methods. 


3 Quadratic Mutual Information 

In this section, we briefly introduce quadratic mutual information and discuss how it 
can be used to perform robust supervised dimension reduction. 


3.1 Quadratic Mutual Information and Mutual Information 

Quadratic mutual information (QMI) is a measure for statistical dependency between 
random variables (Principe et ah, 2000), and is defined as 


QMl{Z,Y) = ^ 


{p{z,y) - p{z)p{y)f dzdy. 


( 11 ) 


QMI(Z, Y) is always non-negative and equals to zero if and only if 2 and y are statis¬ 
tically independent, i.e., p{z, y) = p{z)p{y). Such a property of QMI is similar to that 
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of the ordinary mutual information (MI), which is defined as 

MI(Z, y) = JJ y) log ( 12 ) 

The essential difference between QMI and MI is the discrepancy measure. QMI(Z, Y) 
is the L 2 distance between p{z,y) and p{z)p{y), while M\{Z,Y) is the Kullback- 
Leibler (KL) divergence (Kullback and Leibler, 1951). 

MI has been studied and applied to many data analysis tasks (Cover and Thomas, 
1991). Moreover, an efficient method to estimate MI from data is also available (Suzuki 
et ah, 2008). However, MI is not always the optimal choice for measuring statistical 
dependence because it is not robust against outliers. An intuitive explanation is that MI 
contains the log function and the density ratio: the value of logarithm can be highly 
sharp near zero, and density ratio can be highly fluctuated and diverge to infinity. Thus, 
the value of MI tends to be unstable and unreliable in the presence of outliers. In 
contrast, QMI does not contain the log function and the density ratio, and thus QMI 
should be more robust against outliers than MI. 

Another explanation of the robustness of QMI and MI can be understood based 
on their discrepancy measures. Both L 2 distance (QMI) and KL divergence (MI) can 
be regarded as members of a more general divergence class called the density power 
divergence (Basu et ah, 1998): 

DPa(p||g) = j (^p{xY^°‘ - (^1 + ^ p{x)q{xY + ^ 

where a > 0. Based on this divergence class, the L 2 distance and the KL divergence 
can be obtained by setting a = 1 and a ^ 0, respectively. As discussed in Basu et al. 
(1998), the parameter a controls the robustness against outliers of the divergence, where 
a large value of a indicates high robustness. This means that the L 2 distance (a = 1) is 
more robust against outliers than the KL divergence (a 0). 

In supervised dimension reduction, robustness against outliers is an important re¬ 
quirement because outliers often make supervised dimension reduction methods to 
work poorly. Thus, developing a supervised dimension reduction method based on 
QMI is an attractive approach since QMI is robust against outliers. This QMI-based 
supervised dimension reduction method is performed by finding a matrix W* which 
maximizes QMI(Z, Y): 


q{x\ 




da;, (13) 


W* = argmax QMI(Z,y). 
we{w\ww^=id.Y 

The motivation is that, if QMI(Z, Y) is maximized then z and y are maximally depen¬ 
dent on each other, and thus we may disregard x with a minimal loss of information 
about y. 

Since QMI(Z, Y) is typically unknown, it needs to be estimated from data. Below, 
we review existing QMI estimation methods and then discuss a weakness of performing 
supervised dimension reduction using these QMI estimation methods. 
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3.2 Existing QMI Estimation Methods 

We review two QMI estimation methods whieh estimate QMI(Z, Y) from the given 
data. The first method estimates QMI through density estimation, and the seeond 
method estimates QMI through density differenee estimation. 

3.2.1 QMI Estimator based on Density Estimation 

Expanding Eq.(Il) allows us to express QMI(Z, Y) as 



QMI(Z,y) = l 


A naive approach to estimate QMI(Z, Y) is to separately estimate the unknown den¬ 
sities p{z,y), p{z), and p{y) by density estimation methods such as kernel density 
estimation (KDE) (Silverman, 1986), and then plug the estimates into Eq.(14). 

Eollowing this approach, the KDE-based QMI estimator has been studied and ap¬ 
plied to many problems such diS feature extraction for classification (Torkkola, 2003; 
Principe et ah, 2000), blind source separation (Principe et ah, 2000), and image regis¬ 
tration (Atif et ah, 2003). Although this density estimation based approach was shown 
to work well, accurately estimating densities for high-dimensional data is known to be 
one of the most challenging tasks (Vapnik, 1998). Moreover, the densities contained in 
Eq.(14) are estimated independently without regarding the accuracy of the QMI estima¬ 
tor. Thus, even if each density is accurately estimated, the QMI estimator obtained from 
these density estimates does not necessarily give an accurate QMI. An approach to mit¬ 
igate this problem is to consider density estimators which their combination minimizes 
the estimation error of QMI. Although this approach shows better performance than the 
independent density estimation approach, it still performs poorly in high-dimensional 
problems (Sugiyama et ah, 2013). 

3.2.2 Least-Squares QMI 

To avoid the separate density estimation, an alternative method called least-squares 
QMI (ESQMI) (Sainui and Sugiyama, 2013) was proposed. Below, we briefly review 
the ESQMI method. 

Eirst, notice that QMI(Z, Y) can be expressed in term of the density difference as 



(15) 


where 


f{z,y) =p{z,y) -p{z)p{y). 


The key idea of ESQMI is to directly estimate the density difference f{z,y) without 
going through any density estimation by the procedure of the least-squares density dif¬ 
ference (Sugiyama et ah, 2013). Eetting d{z, y) be a model of the density difference. 
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LSQMI learns d{z, y) so that it is fitted to the true density differenee under the squared 
loss: 


{d{z,y) - f{z,y)f dzdy. 


By expanding the integrand, we obtain 

1 
2 


JJ d{z,y)^dzdy - JJ d{z,y)f{z,y)dzdy + ^ jj f{z,yfdzdy. 


Since the last term is a constant w.r.t. the model d{z^ y), we omit it and obtain the 
following criterion: 


d{z,yfdzdy- // d{z,y)f{z,y)dzdy. 


(16) 


Then, the density difference estimator d{z, y) is obtained as the solution of the follow¬ 
ing minimization problem: 


d = argmin 


d{z,yfdzdy - // d{z,y)f{z,y)dzdy 


(17) 


The solution of the minimization problem in Eq.(17) depends on the choice of the 
model d{z, y). LSQMI employs the following linear-in-parameter model 


d{z,y) = cx -ip{z,y), 

where ck is a parameter vector and 'ily{z, y) is a basis function vector. For this model, 
finding the solution of Eq.(17) is equivalent to solving 


mm 


—cx^ Dot — cx^ q 
2 ^ 


where 


D 


'ip{z,y)'il}{z,y) dzdy, 


(18) 


^ = ff y)f{^^ y)^zdy 

= JJ 'Ip{z,y)p{z,y)dzdy - JJ 'ip{z,y)p{z)p{y)dzdy. (19) 

By approximating the expectation over the densities p(z, y), p{z), and p{y) with sam¬ 
ple averages, we obtain the following empirical minimization problem 


mm 

a 


—cxd'Dcx — cx^q 
2 ^ 


where q is the sample approximation of Eq.(19): 


^ IL ^ IL 

Q = - Y] yi)- — Y yj)- 

1=1 ij=i 
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By including the L 2 regularization term, we obtain 


a = argmin 

a 


1 A T ■ 

-CK Da. — a o H —a a 

2 2 




where A > 0 is the regularization parameter. Then, the solution is obtained analytically 
as 


a = {D + \I)~^q. (20) 

Therefore, the density difference estimator is obtained as 

d{z,y) = a~^'il){z,y). 

Finally, QMI estimator is obtained by substituting the density difference estimator into 
Eq.(15). A direet substitution yields two possible QMI estimators: 

^l{Z,Y) = ^a~^q, (21) 

qMl{Z,Y) = ^a^Da. (22) 

However, it was shown in Sugiyama et al. (2013) that a linear combination of the two 
estimators defined as 

C® (Z, Y) = a^q - ^a^Da, (23) 

provides smaller bias and is a more appropriate QMI estimator. 

As shown above, LSQMI avoids multiple-step density estimation by direetly es¬ 
timating the density difference eontained in QMI. It was shown that such direct esti¬ 
mation proeedure tends to be more accurate than multiple-step estimation (Sugiyama 
et al., 2013). Moreover, LSQMI is able to objectively choose the tuning parameter 
eontained in the basis funetion y) and the regularization parameter A based on 
eross-validation. This property allows LSQMI to solve ehallenging tasks sueh as clus¬ 
tering (Sainui and Sugiyama, 2013) and unsupervised dimension reduction (Sainui and 
Sugiyama, 2014) in an objeetive way. 


3.3 Supervised Dimension Reduction via LSQMI 

Given an effieient QMI estimation method sueh as LSQMI, supervised dimension re¬ 
duction can be performed by finding a matrix W* defined as 

W* = argmax (5vn(Z,y). (24) 

we{w\ww^=id.Z 


A straightforward approaeh to solving Eq.(24) is to perform the gradient aseent: 


W ^W + t 


d^^{Z,Y) 

aw 


1 
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where t > 0 denotes the step size. The update formula means that the essential point 
of the QMI-based supervised dimension reduction method is not the accuracy of the 
QMI estimator, but the accuracy of the estimator of the derivative of the QMI. Thus, 
the existing LSQMI-based approach which first estimates QMI and then compute the 
derivatives of the QMI estimator is not necessarily appropriate since an accurate esti¬ 
mator of QMI does not necessarily mean that its derivative is an accurate estimator of 
the derivative of QMI. Next, we describe our proposed method which overcomes this 
problem. 


4 Derivative of Quadratic Mutual Information 

To cope with the weakness of the QMI estimation methods when performing supervised 
dimension reduction, we propose to directly estimate the derivative of QMI without 
estimating QMI itself. 


4.1 Direct Estimation of the Derivative of Quadratic Mutual Infor¬ 
mation 


From Eq.(15), the derivative of the QMI(Z, Y) w.r.t. the (f, f')-th element of W can be 
expressed by ^ 


aQMI(W) d 


dW, 


ti' 


dW, 


11' 


f{z,yfdzdy 


'^"^ dzdy 


dW, 


i,i' 


df{z,y)^ dz 


f{z,y) 


dz dW, 


-dzdy 


ey 


df{z,y)^ dz 


P{z,y) 


dz dWt 


-dzdy 


t,£' 


df{z,yy dz 


p{z)p{y) 


dz 


dWt 


-dzdy, 


(25) 


11' 


where in the second line we assume that the order of the derivative and the integration 
is interchangeable. By approximating the expectations over the densities p(z, y), p(z), 
and p(y) with sample averages, we obtain an approximation of the derivative of QMI 
as 


aQMI(lV) 


dWc 


11' 


E 

i=\ 


< 9 /( 2 *, dzi 


dz 


dWc 


ly 


-E 


df(zi,y,y dz 




dz 


dW, 


(26) 


ei' 


Note that since ^ we have that is the -dimensional vector 

with zero everywhere except at the £-th dimension which has value Hence, Eq.(26) 

^Throughout this section, we use QMI(W) instead of QMI(Z, F) when we consider its derivative 
for notational convenience. However, they still represent the QMI between random variables 2 : and y. 
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can be simplified as 


^QMI(Vr) _ dfjzj, y^) (£/) ^ df{zi, yj) (^/) 

dWee dz(^) ^ dz(^) ^ ’ 

i=l i,j=l 

This means that the derivative of QMI(Z, Y) w.r.t. W can be obtained once we know 
the derivatives of the density difference w.r.t. z^^'> for all £ G {1,..., dz}- However, 
these derivatives are often unknown and need to be estimated from data. Below, we first 
discuss existing approaches and their drawbacks. Then we propose our approach which 
can overcome the drawbacks. 

4.2 Existing Approaches to Estimate the Derivative of the Density 
Difference 

Our current goal is to obtain the derivative of the density difference w.r.t. z^^^ which can 
be rewritten as 


df{z,y) _ dp{z,y) dp{z) 


(28) 


All terms in Eq.(28) are unknown in practice and need to be estimated from data. There 
are three existing approaches to estimate them. 


(A) Density estimation 

Separately estimate the densities p{z, y), p{z), and p{y) by, e.g., kernel density 
estimation. Then estimate the right-hand side of Eq.(28) as 

y) _ 

dzd) dzd) 

where p{z, y), p{z), and p{y) denote the estimated densities. 

(B) Density derivative estimation 

Estimate the density p{y) by e.g., kernel density estimation. Next, separately 
estimate the densities derivative and by, e.g., the method of mean 

integrated square error for derivatives (Sasaki et ah, 2015), which can estimate 
the density derivative without estimating the density itself. Then estimate the 
right-hand side of Eq.(28) as 


y) _ X 

dzd) dzd) 

where p{y) denotes the estimated density, and and denote the (di¬ 

rectly) estimated density derivatives. 

(C) Density difference estimation 

Estimate the density difference /(z, y) by e.g., least-squares density difference 
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(Sugiyama et al., 2013), which can estimate the density differenee without es¬ 
timating the densities themselves. Then estimate the left-hand side of Eq.(28) 
as 


df{z,y) 

where /(z, y) denotes the (direetly) estimated density differenee. 

The problem of approaches (A) and (B) is that they involve multiple estimation steps 
where some quantities are estimated first and then they are plugged into Eq.(28). Sueh 
multiple-step methods are not appropriate sinee eaeh estimated quantity is obtained 
without regarding the others and the sueeeeding plug-in step using these estimates ean 
magnify the estimation error eontained in eaeh estimated quantity. 

On the other hand, approaeh (C) seems more promising than the previous two ap- 
proaehes sinee there is only one estimated quantity f{z, y). However, it is still not the 
optimal approaeh due to the faet that an aeeurate estimator of the density differenee 
does not neeessarily means that its derivative is an aeeurate estimator of the derivative 
of the density difference. 

To avoid the above problems, we propose a new approach which directly estimates 
the derivative of the density differenee. 

4.3 Direct Estimation of the Derivative of the Density Difference 

We propose to estimate the derivative of the density differenee w.r.t. using a model 

9£{z,y): 


df{z,y) 

dzi^) 


~ gi{z,y). 


The model gi{z, y) is learned so that it is fitted to its eorresponding derivative under the 
square loss: 


Ijj dzdy. (29) 

By expanding the square, we obtain 

\jj gi{z,yfdzdy- jj g,{z,y)^^^^dzdy + ^ jj dzdy. 

Sinee the last term is a eonstant w.r.t. the model gi{z, y), we omit it and obtain the 
following eriterion: 

\ jj 9 e{z,yfdzdy- jj gi{z,y) ^j^^^j dzdy. (30) 
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The second term is intractable due to the unknown derivative of the density difference. 
To make this term tractable, we use integration by parts (Kasube, 1983) to obtain the 
following: 


\gA^, y)/(2. yMlIo.-oo d^vrodj/ 






i ^df{z,y) 

9 e{z, y) dzdy, 


(31) 


where J ■dz\^^(e) denotes an integration over z except for the £-th element. Here, we 
require 


[ge{z, y)f{z, y)]l\t=-oc = (32) 

which is a mild assumption since the tails of the density difference p{z, y) — p{z)p{y) 
often vanish when approaches infinity. Applying the assumption to the left-hand 
side of Eq.(31) allows us to express Eq.(30) as 

^ II ^ II 

Then, the estimator ^^(z, y) is obtained as a solution of the following minimization 
problem: 


'gii = argmin 

91 


1 

2 


9t{z,yfdzdy + 


f{z,y) 


d9i{z,y) 

dz^^'> 


dzdy 


(33) 


The solution of Eq.(33) depends on the choice of the model. Eet us employ the 
following linear-in-parameter model as gi^z, y): 


gi{z,y) = 6 ]ip^{z,y), 


(34) 


where 6i is a parameter vector and (fi^{z, y) is a basis function vector whose practical 
choice will be discussed later in detail. Eor this model, finding the solution of Eq.(33) 
is equivalent to solving 


min 

et 


-Oj+ 9^ hi 


iT 1 


(35) 


where we define 


Hi 

hi 


9^e{z,y)cfi^{z,yfdzdy, 


(36) 


P{z)p{y) dzdy. 


(37) 
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By approximating the expectation over the densities p{z, y), p{z), and p{y) with sam¬ 
ple averages, we obtain the following empirical minimization problem: 


min 

0e 


-ejHeO, 


+ Oj hi 


where is the sample approximation of Eq.(37): 


(38) 


1 d(pi{zi, y^) 1 d(pi{zi,yj) 

n in? dz?') 


(39) 


i=l id=l 

By including the L 2 regularization term to control the model complexity, we obtain 


0e = argmin 
e, 


-9jH,0e 


0jhi + ^0j0i 


(40) 


where > 0 denotes the regularization parameter. This minimization problem is 
convex w.r.t. the parameter 0i, and the solution can be obtained analytically as 

0e = ~ + Xgl) ^ hg, (41) 


where I denotes the identity matrix. Finally, the estimator of the derivative of the 
density difference is obtained by substituting the solution into the model Eq.(34) as 


ggiz,y) = 0^ ipg{z,y). 


(42) 


Using this solution, an estimator of the derivative of QMI can be directly obtained 
by substituting Eq.(42) into Eq.(27) as 


aQm(w) 

dWg,g> 


1 


n 




-T 


X. 


in 


i=l 




(43) 


We call this method the least-squares QMI derivative (ESQMID). 


4.4 Basis Function Design 

As basis function (^g{z, y), we propose to use 


‘•Pii.z.y) 



where h < n. First, let us define the fc-th Gaussian function as 




z, y) = exp 


Uk\ 


\y - Vk\ 


2aj 


(44) 


where Uk and Vk denote Gaussian centers chosen randomly from the data samples 
{zi, and ag denotes the Gaussian width. We may use different Gaussian widths 

for z and y, but this approach significantly increases the computation time for model 
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selection which will be discussed in Section 4.5. In our implementation, we standardize 
each dimension of x and y to have unit variance and zero mean, and then use the com¬ 
mon Gaussian width for both 2 and y. We also set b = min(n, 200) in the experiments. 

Based on the above Gaussian function, we propose to use the following function as 
the fc-th basis for the £-th model of the derivative of the density difference: 



(45) 


This function is the derivative of the k-th Gaussian basis function w.r.t. . A benefit 


of this basis function design is that the integral appeared in Hi can be computed ana¬ 
lytically. Through some simple calculation, we obtain the {k, k')-th element of as 
follows: 



As discussed in Section 5, this basis function choice has further benefits when we 
develop a supervised dimension reduction method. 

4.5 Model Selection by Cross-Validation 

The practical performance of the LSQMID method depends on the choice of the Gaus¬ 
sian width ai and the regularization parameter included in the estimator 'gi{z, y). 
These tuning parameters can be objectively chosen by the iT-fold cross-validation (CV) 
procedure which is described below. 

1. Divide the training data V = {(ccj, ?/i)})Li into K disjoint subsets with 

approximately the same size. In the experiments, we choose K = 5. 

2. For each candidate M = (d^, A^) and each subset Vj, compute a solution 0£,M,\j 
by Eq.(41) with the candidate M and samples from V\Vj (i.e., all data samples 
except samples in Vj). 

3. Compute the CV score of each candidate pair M by 



where denotes computed from the candidate M and samples in Vj. 

4. Choose the tuning parameter pair such that it minimizes the CV score as 


(d^, A^) = argminCV£(M). 


M 
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5 Supervised Dimension Reduction via LSQMID 

In this section, we propose a supervised dimension reduction method based on the pro¬ 
posed LSQMID estimator. 

5.1 Gradient ascent via LSQMID 

Recall that our goal in supervised dimension reduction is to find the matrix W*: 

W* = argmax QMI(Z,y). (46) 

We{W\WW^=IaJ 

A straightforward approach to find a solution of Eq.(46) using the proposed method is 
to perform gradient ascent as 


W ^W + t 


9(^1 (IV) 


(47) 


where t > 0 denotes the step size. It is known that choosing a good step size is a 
difficult task in practice (Nocedal and Wright, 2006). Line search is an algorithm to 
choose a good step size by finding a step size which satisfies certain conditions such 
as the Armijo rule (Armijo, 1966). However, these conditions often require access to 
the objective value QMI( W) which is unavailable in our current setup since the QMI 
derivative is directly estimated without estimating QMI. Thus, if we want to perform 
line search, QMI needs to be estimated separately. However, this is problematic since 
the estimation of the derivative of the QMI and the estimation of the QMI are performed 
independently without regard to the other, and thus they may not be consistent. For 

example, the gradient ,which is supposed to be an ascent direction, may be 

regarded as a descent direction on the surface of the estimated QMI. For such a case, 
the step size chosen by any line search algorithm is unreliable and the resulting W may 
not be a good solution. 

Below, we consider two approaches which can cope with this problem. 


5.2 QMI Approximation via LSQMID 


To avoid separate QMI estimation, we consider an approximated QMI which is obtained 
as a by-product of the proposed method. Recall that the proposed method models the 
derivative of the density difference as 


df{z,y) 


~ 9i{z,y) 

= G]iPi{z,y) 


d{ej(j)^{z,y)) 

dzW 
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(48) 


This means that the density difference can be approximated by 

feiz,y) = dlct)^{z,y) + ci, 


where q is an unknown quantity which is a constant w.r.t. 

In a special case where = 1, we can use Eq.(48) to obtain a proper approximator 
of QMI(Z, y) in a similar fashion to the LSQMI method. To verify this, let us substitute 
Eq.(48) into one of the f{z, y) in Eq.(15) to obtain 


QMI(Z,y) = i 

_ 1 
“ 2 
_ 1 
“ 2 
_ 1 
“ 2 


f{z,y)J{z,y)dzdy 

f{z, y) y) + c) dzdy 

fiz.y)^^ct){z,y)([z([y + ^ JJ f{z,y)cdzdy 

f{z,y)d^(f){z,y)dzdy, 


where the last line follows from 


f{z,y)cdzdy = 

= 0 . 


p{z, y)cdzdy - p{z)p{y)cdzdy 


By approximating the expectation with sample averages, we obtain a QMI approximator 
as 

1 n T 2 ^ 

QMI(Z,y) = —^d^(l){zi,yi) - ^ ^d^(l>{zi,yj)- (49) 

2 = 1 ^J = l 

The main advantage of using QMI(Z, Y) is that it is obtained from the derivative 
estimation, and thus should be consistent with the estimated derivative. This allows us 
to perform line search for the gradient ascent in a consistent manner. We may further 
improve the optimization procedure by considering an optimization problem over the 
Grassmann manifold: 

W* = argmax (^(Z, Y), (50) 

Uz 

where Gr is defined as 

Gr;^^ := {W G | WW^ = ~ . 

That is, Gr^^ is a set of dz-hy-d^ orthonormal matrices whose rows span the same 
subspace. This manifold optimization is more efficient than the original optimization 
since every step of the optimization always satisfies the orthonormal constraint, and we 
no longer need to perform orthonormalization. More details of manifold optimization 
can be found in Absil et al. (2008). 

Although the QMI approximation in Eq.(49) allows us to choose step size by line 
search in a consistent manner, such an approximation is unavailable when > 1. Next, 
we consider an alternative optimization strategy which does not require an access to the 
QMI value. 
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5.3 Fixed-Point Iteration 


To avoid the problem of ehoosing the step size whieh requires an aeeess to the QMI 
value, we propose to use a fixed-point iteration for finding a solution of Eq.(46). Note 
that from the first order optimality eondition, a solution W* is a stationary point whieh 
satisfies 


dW 


= 0 


dz ■,dx 


where 0^^ denotes dz-by-d^ zero matrix. By using the proposed basis funetion in 
Eq.(45), Eq.(43) ean be expressed as 




dW, 




= 


T/T/ fpdF) 


(51) 


where we define 

,{£,£') 




sn 


2 = 1 
n 


*,i=i 


2=1 ^J = l 




*j=i 


2=1 


with be the column vector of length b consisting of the £-th dimension over all 
and the symbol © represents the element-wise vector product. Then, an approximated 
solution may be obtained by finding IT^ £/ for all (£, i.') such that the left-hand side of 
Eq.(51) is zero. This optimization strategy results in a fixed-point iteration for each 
dimension of W : 

r'dF) _ -pdF) 

Einally, we orthonormalize the solution after each iteration as 

w ^ {ww^y^ w. 


In practice, we perform this orthonormalization only every several iterations for com¬ 
putational efficiency. 

Note that the optimization problem in Eq.(46) is non-convex and may have many 
local solutions. To avoid obtaining a poor local optimal solution, we perform the opti¬ 
mization starting from several initial guesses and choose the solution which gives the 
maximum estimated QMI as the final solution. 


6 Experiments 

In this section, we demonstrate the usefulness of the proposed method through experi¬ 
ments. 
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6.1 Illustrative Experiment 

Firstly, we perform the following experiment to illustrate the usefulness of the proposed 
method in term of the QMI derivative estimation. Let Ar(/x, S) denotes the Gaussian 
distribution with mean /x and eovarianee S. Then, for e ~ AA(0, 0.15^), we generate a 
dataset {{xi, and a matrix W as follows: 

X ~ Af{02,12), 

y = + e, 

W = [cos6' sin6'] , 

where O 2 denotes a zero vector of length 2. Thus we have z = cos 9 + sin 6*. 
The goal is to estimate 

dQMl{Z, Y) dQMl{Z, Y) dW 

W “ aW W 

at different value of 9. Note that QMI(Z, Y) is maximized at 9 = 0, i.e., W = [l O]. 

Figure 1(a) shows the averaged value over 20 experiment trials of the estimated 
QMI(Z, Y) by LSQMI. The vertical axis indicates the value of the estimated QMI and 
the horizontal axis indicates value of 0 G [—f , f] • We use n = 3000 and n = 100 for 
estimating QMI and denote the results by LSQMI(3000) and LSQMI(IOO), respectively. 
We perform cross validation at 0 = 0 and use the chosen tuning parameters for all values 
of 9. The result shows that LSQMI accurately estimates QMI(Z, Y) when the sample 
size is large. However, when the sample size is small, the estimated QMI(Z, Y) has 
high fluctuation. 

Figure 1(b) shows the averaged value over 20 experiment trials of the derivative 
of QMI(Z, y) w.r.t. 9 computed by LSQMI(3000), LSQMI(IOO), and the proposed 
method with n = 100 which is denoted by LSQMID(IOO). For the proposed method, 
we perform cross validation at 6* = 0 and use the chosen tuning parameters for all 
values of 9. The result shows that LSQMID(IOO) gives a smoother estimate than 
LSQMI(IOO) which has high fluctuation. To further explain the cause of the fluctua¬ 
tion of LSQMI(IOO), we plot experiment results of 4 trials in Figure 2, where the left 
column corresponds to the value of the estimated QMI(Z, Y) while the right column 
corresponds to the value of the estimated derivative of QMI(Z, Y) w.r.t. 9. These results 
show that for LSQMI(IOO), a small fluctuation in the estimated QMI can cause a large 
fluctuation in the estimated derivative of QMI. On the other hand, LSQMID directly 
estimates the derivative of QMI and thus does not suffer from this problem. 

6.2 Artificial Datasets 

Next, we evaluate the usefulness of the proposed method in supervised dimension re¬ 
duction using artificial datasets. Firstly, let U(a, b) denote the uniform distribution over 
an interval [a,b], T{a,b) denote the gamma distribution with shape parameter a and 
scale parameter b, and Laplace(a, b) denote the Laplace distribution with mean a and 
scale parameter b. Then we consider the input x with dx = 5, the output y with dy = 1, 
and the optimal matrix Wopt (including their rotations) as follows: 
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(a) The averaged estimated QMI. 



(b) The averaged estimated derivative of QMI. 


Figure 1: The mean and standard error of the estimated QMI(Z, Y) and the estimated 
derivative of QMI(Z, Y) w.r.t. 9 over 20 experiment trials. 


Dataset A: For e ~ r(0.25,0.25), we use 

X ~ A/'(05,/5), 


y = exp(-—-) + e, 


VFopt = 


0.5 

7S 75 0 0 0 


Dataset B: For e ~ 7(0.25,0.5) and i e {1,..., 5}, we use 


xW ~ U(-l,l), 
z = + 2a;^^^), 

y = zsm.{z) — e, 


Wopt = 


75 * 0 0 0 


Dataset C: For e ~ 7(0.25,0.5) and i G {1,..., 5}, we use 

~ U(-l,l), 

y = — e, 

v2 


W"opt = 


1 o o o o 
O 1 o o o 


Dataset D: For e ~ AA(0,0.25) and i G {1,..., 5}, we use 

~ Laplace(0, 0.5), 


y = smc( 


+ x*'^^e. 


VFopt = 


1 o o o o 
O 1 o o o 
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(a) The estimated QMI. 


(b) The estimated derivative of QMI. 


Figure 2: Examples of the estimated QMI and the estimated derivative of QMI. The 
left column shows the estimated QMI(Z, Y), and the right column shows the estimated 

derivative of QMI(Z, Y) w.r.t. 9. Each row indicates each experiment trial. 
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(c) Dataset C with n = 400 (d) Dataset D with n = 500 


Figure 3: Artificial datasets. 


For the datasets A, B, and C, e is an additive gamma noise, while for the datasets D, 
e is a multiplicative Gaussian noise. Figure 3 shows the plot of these datasets (after 
standardization). Note the presence of outliers in the datasets. 

To estimate W from {{xi, we execute the following methods: 

LSQMID: The proposed method. Supervised dimension reduction is performed by 
maximizing QMI(Z, Y) where the derivative of QMI(Z, Y) is estimated by the 
proposed method. The solution W is obtained by fixed-point iteration. 

LSQMI: Supervised dimension reduction is performed by maximizing QMI(Z, Y) 
where QMI(Z, Y) is estimated by LSQMI and the derivative of QMI(Z, Y) 
w.r.t. W is computed from the QMI estimator. The solution W is obtained by 
gradient ascent with linesearch over the Grassmann manifold ^. 

LSDR (Suzuki and Sugiyama, 2013): Supervised dimension reduction is performed 
by maximizing SMI(Z, Y). The solution W is obtained by gradient ascent with 

^We use the manifold optimization toolbox (Boumal et al., 2014) to perform the optimization. 


24 







linesearch over the Grassmann manifold 

dMAVE (Xia, 2007): Supervised dimension reduction is performed by minimizing an 
error of the local linear smoother of the conditional density p{y\z). The solution 
W is obtained by alternatively solving quadratic programming problems ^. 

KDR (Fukumizu et ah, 2009): Supervised dimension reduction is performed by mini¬ 
mizing the trace of the conditional covariance operator Tjyy\z- The solution W 
is obtained by gradient descent with linesearch over the Stiefel manifold 

For methods which require initial solutions, i.e., LSQMID, LSQMI, and LSDR, we 
randomly generate 10 orthonormal matrices and use them as the initial solutions. For 
dMAVE and KDR, we use a solution obtained by dOPG and gKDR, respectively, as 
the initial solution. Finally, the obtained solution W is evaluated by the dimension 
reduction error defined as 

Errorjjf^ 11 WW opt FE W11 Frobenius ■} 

where || • || Frobenius dcnotcs the Frobenius norm of a matrix. 

Table 1 shows the mean and standard error over 30 experiment trials of the dimen¬ 
sion reduction error on the artificial datasets with different sample size. The results 
show that the proposed method works well overall. LSDR performs well especially for 
dataset A and B. KDR also performs well overall. However, its performance is quite 
unstable for dataset B, which can be seen by relatively large standard errors. This is 
because gKDR might provide a poor initial solution to KDR in some experiment trials, 
which makes KDR fails to find a good solution. 

On the other hand, both LSQMI and dMAVE do not perform well overall. LSQMI 
tends to be unstable and works very poorly especially when the sample size is small, 
except for dataset D. The cause of this failure could be the high fluctuation of the deriva¬ 
tive of QMI by LSQMI, as shown previously in the illustrative experiment. Although 
the solution of dMAVE is quite stable, its performance is not overall comparable to the 
other methods. This is because the model selection strategy in dMAVE did not perform 
well for these datasets. 

6.3 Benchmark Datasets 

Linally, we evaluate the proposed method in supervised dimension reduction on UCI 
benchmark datasets (Bache and Lichman, 2013). Lor all datasets, we append the origi¬ 
nal input X with noise features of dimensionality 5. More specifically, for the original 
input X with dimensionality dy^, we consider the augmented input x with dimensionality 
dy = dy + 5 as 

X = 71,72,73,74,75], 

^We use the program code: http : / /www. ms . k . u-tokyo . ac . jp/software . html#LSDR 
^We use the program code: http : / /www .stat.nus.edu. sg/ ~staxyc/ 

®We use the program code: http : / /www. ism. ac . jp / - fukumizu/software . html 


25 



Table 1: Mean and standard error of the dimension reduetion error over 30 trials for artifieial datasets. The best method in term of the mean 
error and eomparable methods aeeording to the paired t-test at the signifieanee level 5% are speeified by bold faee. 


Dataset 

n 

LSQMID 

LSQMI 

LSDR 

dMAVE 

KDR 

A 

100 

200 

0.126(0.039) 

0.046(0.006) 

0.394(0.094) 

0.077(0.035) 

0.073(0.007) 

0.046(0.005) 

0.123(0.009) 

0.082(0.007) 

0.077(0.008) 

0.048(0.006) 

6 

100 

200 

0.079(0.009) 

0.041(0.004) 

0.488(0.097) 

0.174(0.064) 

0.072(0.006) 

0.038(0.003) 

0.128(0.010) 

0.078(0.006) 

0.306(0.088) 

0.099(0.045) 

C 

200 

400 

0.192(0.026) 

0.084(0.005) 

0.633(0.100) 

0.108(0.011) 

0.203(0.011) 

0.128(0.006) 

0.146(0.010) 

0.105(0.006) 

0.155(0.010) 

0.090(0.007) 

D 

300 

500 

0.245(0.050) 

0.128(0.017) 

0.301(0.054) 

0.129(0.013) 

0.286(0.032) 

0.185(0.015) 

0.370(0.046) 

0.263(0.038) 

0.257(0.025) 

0.198(0.013) 


Table 2: Mean and standard error of the root mean squared error over 30 trials for benehmark datasets. The best method in term of the mean 
error and eomparable methods aeeording to the paired t-test at the signifieanee level 5% are speeified by bold faee. 


Dataset 

ntr 

dx 


LSQMID 

LSQMI 

LSDR 

dMAVE 

KDR 

Fertility 

50 

14 

1 

2 

3 

4 

1.215(0.049) 

1.051(0.045) 

1.052(0.044) 

1.046(0.042) 

1.092(0.043) 

1.029(0.043) 

1.038(0.047) 

1.026(0.042) 

1.315(0.043) 

1.199(0.031) 

1.104(0.044) 

1.092(0.039) 

1.321(0.063) 

1.340(0.052) 

1.288(0.048) 

1.271(0.033) 

1.116(0.050) 

1.104(0.044) 

1.121(0.043) 

1.146(0.044) 

Yacht 

100 

11 

1 

2 

3 

4 

0.120(0.005) 

0.154(0.011) 

0.314(0.024) 

0.413(0.021) 

0.546(0.042) 

0.675(0.047) 

0.690(0.037) 

0.732(0.043) 

0.180(0.012) 

0.344(0.023) 

0.425(0.018) 

0.494(0.015) 

0.213(0.017) 

0.224(0.014) 

0.265(0.013) 

0.352(0.017) 

0.124(0.007) 

0.278(0.033) 

0.353(0.028) 

0.399(0.012) 

Concrete 

200 

13 

1 

2 

3 

4 

0.621(0.013) 

0.568(0.010) 

0.557(0.009) 

0.545(0.012) 

0.606(0.014) 

0.591(0.009) 

0.579(0.011) 

0.667(0.025) 

0.606(0.008) 

0.568(0.010) 

0.576(0.012) 

0.568(0.010) 

0.582(0.006) 

0.529(0.009) 

0.539(0.007) 

0.540(0.008) 

0.791(0.030) 

0.614(0.025) 

0.579(0.016) 

0.571(0.014) 

Breast-cancer 

200 

15 

1 

2 

3 

4 

0.447(0.011) 

0.435(0.010) 

0.376(0.004) 

0.377(0.005) 

0.523(0.018) 

0.473(0.012) 

0.462(0.010) 

0.419(0.008) 

0.442(0.010) 

0.437(0.009) 

0.431(0.007) 

0.436(0.007) 

0.375(0.007) 

0.420(0.012) 

0.426(0.008) 

0.426(0.011) 

0.447(0.012) 

0.454(0.014) 

0.430(0.007) 

0.433(0.007) 

Bike 

300 

19 

1 

2 

3 

4 

0.043(0.011) 

0.036(0.005) 

0.037(0.005) 

0.060(0.006) 

0.070(0.019) 

0.035(0.003) 

0.032(0.003) 

0.051(0.007) 

0.016(0.001) 

0.049(0.002) 

0.065(0.002) 

0.077(0.002) 

0.139(0.051) 

0.081(0.007) 

0.086(0.008) 

0.071(0.005) 

0.513(0.059) 

0.291(0.050) 

0.243(0.037) 

0.213(0.029) 





























where 7 ^ ~ r(l,2) for i e {1,... ,5}. Then we use the paired data {($i, 
to perform experiments. We randomly ehoose ntr samples for training purposes, and 
use the rest rite = n — ^tr for testing purposes. We exeeute the supervised dimension 
reduction methods with target dimensionality € {1, 2, 3,4} to obtain solutions W. 
Then we train a kernel ridge regressor y = f{Wx) with the Gaussian kernel where 
the tuning parameters are chosen by 5-fold cross-validation. Finally, we evaluate the 
regressor by the root mean squared error (RMSE): 



Table 2 shows the RMSE averaged over 30 trials for the benchmark experiments. It 
shows that the proposed method performs well overall on all datasets. ESQMI performs 
very well for the ‘Eertility’ and ‘Bike’ datasets, but its performance is quite poor for the 
other datasets. In contrast, dMAVE performs very well especially for the ‘Concrete’ 
dataset where it gives the best solutions for all value of d^- However, its performance is 
quite poor for the ‘Eertility’ and ‘Bike’ datasets. Both ESDR and KDR do not perform 
well on these datasets. 

7 Conclusion 

We proposed a novel supervised dimension reduction method based on efficient maxi¬ 
mization of quadratic mutual information (QMI). Our key idea was to directly estimate 
the derivative of QMI without estimating QMI itself. We firstly developed a method to 
directly estimate the derivative of QMI, and then developed fixed-point iteration which 
efficiently uses the derivative estimator to find a maximizer of QMI. In addition to 
the robustness against outliers thanks to the property of QMI, the proposed method is 
widely applicable because it does not require any assumption on the data distribution 
and tuning parameters can be objectively chosen via cross-validation. The experiment 
results on artificial and benchmark datasets showed that the proposed method is promis¬ 
ing. 
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